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Abstract. We present the first in-place algorithm for sorting an array of size n that 
performs, in the worst case, at most 0(?ilog?i) element comparisons and 0(n) element 
transports. 

This solves a long-standing open problem, stated explicitly, e.g., in [J.I. Munro and 
V. Raman, Sorting with minimum data movement, J. Algorithms, 13, 374-93, 1992], of 
yj-^ whether there exists a sorting algorithm that matches the asymptotic lower bounds on 

f^*) , all computational resources simultaneously. 
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1. Introduction 



From the very beginnings of computer science, sorting is one of the most fundamental 
problems, of great practical and theoretical importance. Virtually in every field of computer 
science there are problems that have the sorting of a set of objects as a primary step toward 
solution. (For early history of sorting, see Sect. 5.5]). It is well-known that a comparison- 
based algorithm must perform, in the worst case, at least [logn!] > ra-logre — n Joge ~ 
n-logn — 1.443n comparisons to sort an array consisting of n elements. (All logarithms 
throughout this paper are to the base 2, unless otherwise stated explicitly). By ^F3\, the 
corresponding lower bound for element moves is [3/2 -nj. 

Concerning upper bounds for the number of comparisons, already the plain version 
of mergesort gets closely to the optimum, with at most n ■ [log n] — n + 1 comparisons. 
However, this algorithm needs also an auxiliary array for storing n elements, it is not an 
in-place algorithm. That is, it does not work with only a constant auxiliary storage, besides 
the data stored in the input array. In-place algorithms play an important role, because they 
maximize the size of data that can be processed in the main memory without an access, 
during the computation, to a secondary storage device. 
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The rich history of comparisons-storage family of sorting algorithms, using 0{n ■ log n) 
comparisons and, at the same time, O(l) auxiliary storage, begins with a binary-search 
version of insertsort. This algorithm uses less than logn! + n comparisons, only a single 
storage location for putting elements aside, and only O(l) index variables, of log n bits each, 
for pointing to the input array. Unfortunately, the algorithm performs f2(n 2 ) element moves, 
which makes it unacceptably slow, as n increases. 

The heapsort was the first in-place sorting algorithm with a total running time 
bounded by 0{n dog n) in the worst case. More precisely, it uses less than 2ndogn com- 
parisons with the same O(l) storage requirements as insertsort, but only ndogn + 0(n) 
moves, if the moves are organized a little bit carefully. Since then, many cloned versions 
of heapsort have been developed; the two most important ones are bottom-up-heapsort 
and a log*- variant Both these variants use not only the same number of moves as the 
standard heapsort, but even exactly the same sequence of element moves for each input. (See 
also the procedure "shiftdown" in jlfij). However, they differ in the number of comparisons. 
Though bottom-up variant uses only 3/ 2-n-log n + 0(n) comparisons, its upper bound for the 
average case is even more important; with ndogn + 0(n) comparisons, it is one of the most 
efficient in-place sorting algorithms. The log*-variant is slightly less efficient in an average, 
but it guarantees less than n ■ log n + n • log*n comparisons in the worst case. For a more 
detailed analysis, see also [TU1 ITT)] . 

Then in-place variants of a A;- way mergesort came to the scene (HI E]; with at most 
ndogn + 0(n) comparisons, O(l) auxiliary storage, and e-ndogn + 0(n) moves. Instead 
of merging only 2 blocks, k sorted blocks are merged together at the same time. Here k 
denotes an arbitrarily large, but fixed, integer constant, and £>0 an arbitrarily small, but 
fixed, real constant. Except for the first extracted element in each fe-tuple of blocks, the 
smallest element is found with log k comparisons, if A; is a power of two, since the k currently 
leftmost elements of the respective blocks are organized into a selection tree. Though log k is 
more than one comparison required in the standard 2-way merging, the number of merging 
sweeps across the array comes down to [log n/log k~\ , so the number of comparisons is almost 
unchanged. As an additional bonus, the number of element moves is reduced if, instead of 
elements, only pointers to elements are swapped in the selection tree. By the use of some 
other tricks, the algorithm is made in-place and the size of auxiliary storage is reduced 
to 0(1). The early implementation of this algorithm, having so promising upper bounds, 
turned out to be unacceptably slow. It was observed that operations with indices represent- 
ing the current state of the selection tree became a bottleneck of the program. Fortunately, 
the state of a selection tree with a constant number of leaves can be represented implicitly, 
without swapping indices. This indicates that even by summing comparisons and moves we 
do not get the whole truth, the arithmetic operations with indices are also important. 

The /c-way variant has been generalized to a (log n/log log n)-way in-place mergesort (Zj. 
This algorithm uses ndogn + 0(n dog logn) comparisons, O(l) auxiliary storage, and only 
0(n dog n/log logn) element moves. Since k is no longer a constant here, the information 
about the selection tree is compressed, among others, into bits of (logn)-bit index variables 
by complicated bitwise operations, which increases, among others, the number of arithmetic 
operations. Therefore, the algorithm is mainly of theoretical interest; it is the first member 
of the comparisons-storage family breaking the bound O(ndogn) for the number of moves. 

The transports-storage family of algorithms, sorting with O(n) element moves and O(l) 
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auxiliary storage, is not so numerous. The first algorithm of this type is selectsort, which is a 
natural counterpart of insertsort. Carefully implemented, it sorts with at most 2n— 1 moves, 
a single location for putting one element aside, and O(l) index variables. Unfortunately, it 
performs also f2(n 2 ) comparisons. 

As shown in ^3], 0(n 2 ) comparisons and 0(1) indices suffice for reduction of the number 
of moves to the lower bound [3/2-nJ. 

Another improvement is a generalized heapsort ,11 i: It is based on a heap in which 
internal nodes have |_^ 1//fe J children, for a fixed integer k. The corresponding heap tree 
is thus of constant height, which results in an algorithm sorting with 0(n) moves, O(l) 
storage, and 0{n l+£ ) comparisons. 

Finally, consider the comparisons-transports family, sorting with O(n-logn) comparisons 
and 0(n) element moves. The first member is a so-called tablesort We use any 
algorithm with 0(n Tog n) comparisons but, instead of elements, we move only indices 
pointing to the elements. When each element's final position has been determined, we 
transport all elements to their destinations in linear time. However, this algorithm requires 
0(n) auxiliary indices. 

The storage requirements have been reduced to 0(n £ ) by a variant of samplesort jllj . 
The same result can also be obtained by the in-place variant of the A:-way mergesort [HI El > 
mentioned above, if k = \n £ ~\ . This reduces the number of merging sweeps down to a constant, 
which results in 0(n -log n) comparisons and 0(n) element moves. Such modification is no 
longer in-place, it uses 0{n e ) auxiliary indices to represent a selection tree. We leave the 
details to the reader. 

So far, there was no known algorithm sorting, in the worst case, with 0(n -log n) com- 
parisons, 0(n) moves, O(l) auxiliary storage, and, at the same time, O(ndogn) arithmetic 
operations. 

This ultimate goal has only been achieved in the average case |llj . In the worst case, 
the algorithm uses Q(n 2 ) comparisons but, for a randomly chosen permutation of input 
elements, the probability of this worst case scenario is negligible. 

It was generally conjectured, for many years, that an algorithm matching simultaneously 
the asymptotic lower bounds on all above computational resources does not exist. For ex- 
ample, in [T3|, it was proved that the algorithm with 0(n 1+e ) comparisons using generalized 
heaps is optimal among a certain restricted family of in-place sorting algorithms performing 
0(n) moves. It was hoped that, by generalizing from a restricted computational model to 
all comparison-based algorithms, we could get a higher trade-off among comparisons, moves, 
and storage. 

1.1. Our result. The result we shall present in this paper contradicts the above conjectures 
and closes a long-standing open problem. We shall exhibit the first sorting algorithm of 
the type comparisons-transports-storage. Our algorithm operates in-place, with at most 
2n-log n + o(n-log n) element comparisons and (13+e)-n element moves in the worst case, for 
each n > 1. Here e>0 denotes an arbitrarily small, but fixed, real constant. The number 
of auxiliary arithmetic operations with indices is bounded by 0(n -log n). We can slightly 
reduce the number of moves, to (12+e)-n, in a modified version that uses 6n-log n + o(n-log n) 
comparisons. 

The algorithm was born as a union of the ideas contained in two independent technical 
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reports, [U|3|. We believe that, besides the theoretical breakthrough achieved by its analysis, 
the algorithm can also be of practical interest, because of its simplicity. 

1.2. Algorithm in a nutshell. Using an evenly distributed sample a\,...,af of size 
0(n/(logn) 4 ), split the elements into segments cr , a%, . . . , cy, of length 0((logn) 4 ) each, 
so that elements in cr& satisfy a& < a < afc+i- The sorted array is obtained by forming 
ctq, oi, oi, ■ ■ ■ j o,f, a'f , where a' k denotes in sorted order. To sort a^, use a modified heap- 
sort, with internal nodes having B((logn) 4 / 5 ) sons, which results in a constant number of 
moves per each element extracted from the heap. 

Since an evenly distributed sample is hard to find, it grows dynamically; when some 
becomes too large, halve it into two segments of equal length, and insert the median in the 
sample. To minimize moves required for insertions in the sample, it is sparsely distributed in 
a block of size 0(n/(log n) 3 ), not losing advantage of a quick binary search. A local density 
of elements is eliminated by redistributing the sample more evenly, which does not happen 
"too often." To avoid the corresponding segment movement, only pointers connecting a^'s 
with cafe's are moved, the segments stay motionless in a separate workspace. 

However, we do not have a buffer of size 3n, required for the sample and the segments, 
nor P~0(n/(logn) 2 ) bits, for pointers. The bits are "created" at the very beginning by a 
modified heapsort, collecting the smallest and the largest P elements to blocks IIl and ITr, 
which leaves a block A! in between. Then the jth bit can be encoded by swapping the jth 
element in IIl with the jth element in IIr. 

To "create" a buffer for sorting the block A! of length n\ select the element b~ of rank 
|_n'/4j and partition A' into blocks A K and B > , using b~ as a pivot. Then sort A using 
B > as an empty buffer. (We can test if a given location contains a buffer element, by a 
single comparison with b~. Before an "active" element is moved, one buffer element escapes 
to the current location of the hole). After sorting A K we iterate, focusing on B > as a new 
block A'. After O(logn) iterations, we are done. 

2. Sorting with an Additional Memory 

Before presenting our in-place algorithm, we shall concentrate on a simpler task. We are 
going to sort a given contiguous block A, consisting of m elements, using only 0(m dog m) 
comparisons and 0(m) element moves. As some additional resources, we are given a buffer 
memory, of size at least 3m — 1, that can be used as a temporary workspace, and a pointer 
memory, capable of containing at least |_4m/(log m) 2 J bits. 

To let the elements move, we also have a hole, that is, one location, the content of which 
can be modified without destroying any element. An assignment a,j :=cij transports not only 
one element from the location i to j, but also the hole from j to i. At the very beginning, 
the hole is in a single extra location, besides the given input array. 

2.1. Buffer memory. The buffer memory forms a separate contiguous block B, initially 
consisting of at least 3m— 1 buffer elements. All buffer elements are greater than or equal to 
a given buffer separator b~, placed in an extra location, while all elements in A are strictly 
smaller than b~. During the computation, the elements of A and B are mixed up. However, 
by a single comparison with b~, we can test whether any given location contains a buffer 
element, or an active element, a subject of sorting, placed originally in A. 
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so that s is odd. (1) 



The buffer memory B consists of two parts. First, there is a low level segment memory, a 
sequence of segments allocated dynamically from the right end of B and growing to the left, 
as the computation demands. All allocated segments are of the same fixed length. Second, 
there is a fixed high level frame memory, placed at the left end of B. 

2.2. Structure of the segment memory. All segments are of a fixed length s, where 

J [(logm) 4 ] 
S ~\ [(logm) 4 ]+l 

During the computation, the number of active segments never exceeds s#, defined by 

% = [2m/ s\ < 2m/(logm) 4 , (2) 

and hence the size of workspace reserved for the segment memory is bounded by 

S = %-s < 2m . (3) 

Here we assume that m is "sufficiently large," such that s < m, and hence % > 2. We shall 
later discuss how to handle a block A that is "short." 

Initially, all segments are free, containing buffer elements only. The algorithm keeps the 
starting position of the last segment that has been allocated in a global index variable s. 
Initially, s points to the right end of the buffer memory B. To allocate a new segment, 
the procedure simply performs the operation s := s — s, and returns the new value of s as 
the starting position of the new segment. Immediately after allocation, some [s/2\ active 
elements (smaller than b~) are transported to the first [s/2\ positions of the new segment. 
The corresponding buffer elements are saved in the locations released by the active elements. 
From this point forward, the segment becomes active. 

In general, the structure of an active segment is c\ . . . Chbh + i . . . b s , where c\ . . . c/j are 
active elements stored in the segment, while bh+i ■ ■ ■ b s are some buffer elements. The value 
of h is kept between [s/2\ and s — 1, so that at least one half (roughly) of elements in each 
active segment is active, and still there is a room for storing one more active element. Neither 
c\ . . . Ch nor bh+i ■ ■ ■ b s are sorted. In addition, the algorithm does not keep any information 
about the boundary h separating active and buffer elements, if the segment is not being 
manipulated at the present moment. However, since all active elements are strictly smaller 
than b~ and all buffer elements are greater than or equal to b~, we can quickly determine 
the number of active elements in any given segment, using a binary search with b~ over the 
s locations of the segment, which costs only l+[logsJ < O (log logm) comparisons, by (^Q). 

2.3. Structure of the frame memory. The frame memory, placed at the left end of B, 
consists of r # so-called frame blocks, each of length r, where 

r = 1+ |"log(2m/s)] < 2+log(2m/8) =logm, . , 

r# = 2 r ~ 1 = 2^(2™/*)! < 2-2m/s < 4m/(logm) 4 , ^ ' 

using (^Q) and m>4. That is, the frame memory is of total length 

R = r # -r < Am/ (log m) 3 . (5) 
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Using (J3J) and m > 4, we get that the total space requirements for the segment and frame 
memories do not exceed the size of the buffer B, since R+S < 4m/(logm) 3 + 2m < 3m— 1. 

A frame block is either free, containing buffer elements only, or it is active, containing 
some active elements followed by some buffer elements. Initially, all frame blocks are free. 
During the computation, active frame blocks are concentrated in a contiguous left part of 
the frame, followed by some free frame blocks in the right part. However, there are some 
important differences from the segment memory structure: 

First, the active elements, forming a left part of a frame block, are in sorted order. 
So are the active frame blocks, forming a left part of the frame memory. More precisely, 
let a\ , a,2 , . . . , a,f denote the sequence of all active elements stored in the frame memory, 
obtained by reading active elements from left to right, ignoring buffer elements and frame 
block boundaries. Then a\,02, ■ ■ ■ , a/ is a sorted sequence of elements. Consequently, a 
subsequence of these, stored in the first (leftmost) positions of active frame blocks, denoted 
here by , a^, . . . , Oj , must also be sorted. Here / denotes the total number of active 
elements in the frame, while g the number of active frame blocks, at the given moment. 
Similarly, aj-aj.+ia^.-^ • • • (H j+1 —i, the sequence of active elements stored in the jth frame 
block, is also sorted. 

Second, the number of active elements in an active frame block can range between 1 
and r — 1. That is, we keep room for potential storing of one more active element in each 
active frame block, but we do not care about a sparse distribution of active elements in the 
frame. The only restriction follows from the fact that there are no free blocks in between 
some active blocks. 

2.4. Relationship between the frame and segments. Each active element in the frame 
memory, i.e., each of the elements a±, a,2, ■ ■ ■ , a/, has an associated segment a\, 02, ■ ■ ■ , cry 
in the segment memory. The segment Cfc, for k ranging between 1 and /, contains some 
active elements satisfying < a < cifc+i, taken from A and stored in the structure so far. 
The active elements satisfying a/<a are stored in aj, similarly, those satisfying a<a\ are 
stored in a special segment uq. Note that the segment <7o has no "parent" in the sequence 
ai, 02j ■ ■ ■ , a/, that is, no frame element to be associated with. Chronologically, o"o is the 
first active segment that has been allocated. If / = 0, i.e., no active elements have been 
stored in the frame yet, all active elements are transported from A to <jq. 

Note also that (in order to keep the number of active elements in active segments bal- 
anced) we do allow some elements equal to a& be stored both in Cfc-i an d in cr fc . In general, 
we may even have = a^+i = . . . = a^/, for some k < k'. Then elements equal to may be 
found in any of the segments crj._i,Cfc, . . . ,(Jy- However, the algorithm tries to store each 
"new" active element a, coming from A, in the leftmost segment that can be used at the 
moment, i.e., it searches for k satisfying ak<a<ak+\- 

Recall that we also maintain the invariant that each active segment contains at least 
[s/2j active elements. Thus, if the frame contains / active elements at the given moment, 
namely, a±, 02, ■ ■ ■ , a/, for some / > 1, the total number of active elements, stored both in 
the frame and the segments <7 , ai, ct 2 , . . . , er/, is at least / + (/ + 1) • [s/2\ . Now, using the 
fact that s is odd, by we get that this number is at least / + (/ + 1) • (s/2 — 1/2) = 
(/+1) • s/2 + (//2— 1/2) > (/+1) • s/2. However, the total number of all active elements is 
exactly equal to m, which gives m > • s/2, and hence also f + l < 2m/ s. But / + 1, 

the number of active segments, is an integer number, which gives that f + l < [2m/ s\. 
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Therefore, using © and 

f + 1 < [2m/s\=^, 

f < [2m/ s] < 2n°s( 2m /s)l = r# . W 

(The argument has used the assumption that /> 1. However, Q is trivial for / = 0, since 
r # >% >2, if m is sufficiently large). 

As a consequence, we get that the number of active segments, does not exceed %, 

the capacity of the segment memory. Second, /, the number of active elements in the frame, 
will never exceed r # , the total number of blocks in the frame, and hence there is enough 
room to store all active frame elements, even if each active frame block contained only a 
single element of the sequence 01, (12, ■ ■ ■ , a/. 

2.5. Structure of the pointer memory. The relative order of active frame elements in 
the sequence ai,a,2, ■ ■ ■ ,0/ does not correspond to the chronological order, in which the seg- 
ments o"o, <7i, (T2, . . . , 07 are allocated in the segment memory. Therefore, with each element 
position in the frame, we associate a pointer to the starting position of corresponding seg- 
ment. More precisely, if the frame is viewed as a single contiguous zone of elements x\ . . . xr 
(ignoring boundaries between the frame blocks), then the corresponding zone of pointers is 
7Ti . . . itr. If, for some £, the element xg is a buffer element, then tt£ = 0, which represents a 
NIL pointer. Conversely, if X£ is an active element belonging to the sequence a±, a-i, ■ ■ ■ , a/, 
then the value of 'Kg represents the starting position of the segment associated with xg. (The 
pointer ttq to the segment o"o, having no "parent" in the frame, is stored separately, in a 
global index variable). 

Since there are at most s# segments, all of equal length, a pointer to a segment can 
be represented by an integer value ranging between and % = [2m/ s\ < m/2, using ©. 
Thus, a single pointer can be represented by a block of p bits, where 

P = 1 + |_l°g %J < log m . (7) 

The number of pointers is clearly equal to R, the total size of the frame. Therefore, 

P # = R. 

Thus, the pointer memory can be viewed as a contiguous array consisting of p^ bit blocks, 
of p bits each, and hence, by ©, its total length is at most 

P = p # -p = R-p < |4m/(logm) 2 J , (8) 

using also the fact that P must be an integer number. 

Since an in-place algorithm can store only a limited amount of information in index 
variables, the pointer memory is actually simulated by two separate contiguous blocks I1l 
and each containing at least [Am/ (\ogrn) 2 \ elements. Initially, Hl and IIr are sorted, 
and the largest (rightmost) element in IIl is strictly smaller than the smallest (leftmost) 
element in IIr. This allows us to encode the value of the jih bit, for any j ranging between 1 
and [4m/ (log m) 2 J , by swapping the jth element of IIl with the jth element of ITr. Testing 
the value of the jth bit is thus equivalent to comparing the relative order of the corresponding 
elements in IIl and IIr, which costs only a single comparison. Setting a single bit value 
requires a single comparison and, optionally, a single swap of two elements, i.e., 3 element 
moves. The initial distribution of elements in IIl and Hr represents all \_4m/ (log m) 2 J bits 
cleared to zero. 
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2.6. Inserting elements in the structure. The procedure sorting the block A works in 
two phases. In the first phase, the procedure takes, one after another, all m active elements 
from A and inserts them in the structure described above. The procedure also saves some 
buffer elements from B, and keeps the structure "balanced." In the second phase, all active 
elements are transported back to A, this time in sorted order. 

For each active element a in A, we find a segment, among uq, <ti, 03, . . . , 07, where this 
element should go. 

First, by the use of a binary search with the given element a over aj 2 , . . . ,di g , that is, 
over the leftmost locations in the active frame blocks, find the "proper" frame block for the 
element a, i.e., the index j satisfying ctj. <o<aj +1 . Note that the element is excluded 
from the range of the binary search. If a < ai 2 , the binary search will return j = l, i.e., the 
first frame block. Similarly, for <2j g < a, the binary search returns j = g, i.e., the last frame 
block. If g < 2, we can go directly to the first (and only) active frame block without using 
any binary search, that is, j:=l. 

Second, by the use of a binary search with the given element a over the r locations in the 
jth active frame block, find the "proper" active frame element for the element a, i.e., the 
index k satisfying < a< a>k+i- Note that, since a%. < a < ai j+x , the elements and a^+i are 
between a%. and in the sequence a%, 0,2, . . . , a/ of all frame elements, not excluding the 
possibility that = <Zfc, and/or ak+i = a i j+1 - Recall that the jth active frame block begins 
with the active elements • • - a i j+ i-ii followed by some buffer elements, to fill 

up the room, so that the length of the block is exactly equal to r. These buffer elements are 
not sorted, however, they are all greater than or equal to b~, the smallest buffer element. 
On the other hand, the element a, being active, is strictly smaller than b~. This allows us 
to use the binary search with the given a in the standard way, which returns the index k 
satisfying at<a< at+i- For cn j+1 -\ < a, the binary search returns correctly k = — 1. If 
j = 1, that is, if we are in the first frame block, the binary search may end up with k = 0, 
indicating that a < a\ = . 

Third, let the active frame element a^, satisfying a& <a<dfc+i, be placed in a position £ 
of the frame memory, that is, a,k = X£. (For k = 0, we take £:=0). Then read the information 
from tt£ in the pointer memory and compute the starting position of the segment a^. This 
segment contains elements ranging between a& and cifc+i. If k = Q, i.e., the element a should 
go to o"o, the starting position of the segment is obtained from a separate global index 
variable. 

Fourth, by the use of a binary search with the buffer separator b~ over the s locations 
in the current segment, find the boundary h dividing the segment into two parts, namely, 
c\ . . . Ch, the active elements stored in the segment, and . . . b s , some buffer elements, 
filling up the room. 

Fifth, save the buffer element b^+i aside, to the current location of the hole, and, after 
that, store the given element a in the segment. If h+l<s, we are ready to insert the next 
element from A. However, if h+l = s, the current segment cannot absorb any more elements. 
Therefore, if the segment has become full, we call a procedure "rebalancing" the structure 
before trying to store the next element. This procedure will be described later, in Sect. 12.91 

The above process is repeated until all m active elements have been inserted in the 
structure. 

Initially, the procedure allocates the segment ctq, and stores the first s— 1 active elements 
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directly in o~o, without travelling via the frame. The number of moves for these elements is 
the same as in the standard case, i.e., two moves per each inserted element. 

Let us now determine the standard cost of inserting a single element. The binary search 
looking for a proper frame block inspects a range consisting of g— 1 <r # elements, and hence 
it performs at most l+[logr # J < logm comparisons, by ©. The second binary search, 
looking for a proper active element within the given frame block, inspects a range of r 
elements, performing at most l+[logrJ < O(loglogm) comparisons, using (JH>. Reading the 
value encoded in the pointer tti requires p<logm element comparisons, by (J7J). The binary 
search with b~ over the s locations in the current segment uses l+dogsj < O (log logm) 
comparisons, by Finally, saving one buffer element and transporting the element a to 
the current segment can be performed with 2 element moves. However, these costs do not 
include rebalancing. Since m elements are inserted this way, we get: 

Lemma 1. If we exclude the costs of rebalancing, inserting m elements in the structure 
requires 2m dog m + 0(m dog log m) comparisons and 2m moves. 

2.7. Extracting in sorted order — frame level. In the second phase, the active ele- 
ments are transported back to A, in sorted order. Let / m denote the maximal value of /, 
corresponding to the number of active elements in the frame at the moment when the last 
active element has been stored in the structure. Thus, the frame memory contains the 
sorted sequence of active elements a\, a%, . . . , df m , intertwined with some buffer elements, 
so the total size of the frame is R, consisting of elements x\ . . . xr. Then we have active 
elements in the segments o"o, a\, 02, . . . , <r/ m , with at containing active elements that satisfy 
Ofc < o < Ofc+i. Thus, to produce the sorted order of all active elements, it is sufficient to 
move, back to A, the sequence a' , a±, a[, a 2 , a' 2 , ■ ■ ■ , a/ m , Oy m , where o' k denotes the block of 
sorted active elements contained in a^. 

The procedure begins with moving the block a' to A. (We shall return to the problem 
of sorting a given segment below, in Sect. 12.8)1 . 

Then, in a loop iterated for i = 1, . . . ,R, check whether X£ is an active element. This 
requires only a single comparison, comparing X£ with b~. If X£ is a buffer element, it is 
skipped, we can go to the next element in the frame. 

If X£ is an active element, i.e., xi = a^, for some k, the procedure saves the leftmost 
buffer element, not moved yet from the output block A, in the current location of the hole 
and, after that, moves X£ = a\ z to A. (The first free position in A, i.e., the position of the 
leftmost buffer element, is kept in a separate global index variable, and incremented each 
time a new active element is transported back to A). Then we read the value encoded in 
the pointer tti and compute the starting position of the segment a^. After that, we move 
all active elements contained in to A, in sorted order, by the procedure presented in 
Sect. El 

Before showing how the segment can be sorted, let us derive computational costs of 
the above procedure, not including the cost of sorting a^. Testing whether Xj> is an active 
element, for £ = 1, . . . ,R, requires R < 0(m/(logm) 3 ) comparisons, by (fHjl. Transporting 
%i = a k f° ^4 requires only 2/ m element moves in total, since only active elements are moved. 
This gives 2f m < 2r # < 0(m/(logm) 4 ) element moves, by (JHJ) and (jlj. Reading the values 
of / m pointers, of length p bits each, can be done with f m -p < r # -p < 0(m/(logm) 3 ) 
comparisons, using ©, ®, and Q. Summing up, we have: 
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Lemma 2. If we exclude the costs of sorting the segments, extracting in sorted order re- 
quires 0(m/(logm) 3 ) comparisons and 0(m/ (log m) ) moves. 

2.8. Extracting in sorted order — segment level. Now we can describe the routine 
extracting, in sorted order, all active elements contained in the given segment a k . Let hk 
denote the number of active elements in a k . Clearly, h k < s < [(log m) 4 ] + 1, using (|T|). 
Initially, the routine determines the value of hk by the use of a binary search with b~ over 
the s locations of the segment. This costs l+[logsJ < O(loglogm) comparisons. 

After that, the routine uses a generalized version of heapsort, which in turn uses a 
modified heap-like structure, with 



root nodes (instead of a single root node), and with internal nodes having t sons (instead 
of two sons) . More precisely, we organize c\ . . . c/j fc , the active elements contained in the 
segment, into the implicit structure with the following properties: 

First, the father of the node c e is the node c e /, where e' = |_(e — , provided that 
e' > 1. If e' < 1, then c e is one of the root nodes. This implies that the heap has t roots, 
and that the sons of c e are the nodes Q. e+ i, Q. e+ 2, . . . , ct- e +t- If, for some e and d < t, we 
have t-e+d = h k , the corresponding node c e has only d sons, instead of t. A leaf is a node 
c e without any sons, that is, with t-e>h k . 

Before passing further, note that the heap does not have more than five levels, since, by 
travelling to a root from Ch k , we get 



hW = [(h k -l)/t\ <h k /t, 

h (2) = [(fed) <h k /t\ 

fc(3) = L(^)-l)/tj <h k /t\ 

fc(4) = L(^(3)_l)/tj < h k /t\ 

h (5) = l(hW-l)/t] < hW/t-l/t < h k /t 5 -l/t < s/t 5 -l/t 5 . 



If we had 1 < h^ , then l<s/t 5 — 1/t 5 , and hence also t 5 < s — 1 . Now, using t=[(logm) 4 / 5 ] 



The second property of our heap is that, if a node contains an active element, then this 
element is not greater than any of its sons. Note that we do not care about sons of a node 
containing a buffer element. (Initially, there are no buffer elements in the heap. However, 
when some active elements have been extracted, buffer elements will fill up the holes). 

This heap property is established in the standard way: For e = [(h k — l)/t\, . . . , 1, 
establish this property in the positions e, . . . , h k . This only requires to determine whether 
c e is not greater than the smallest of its sons and, if necessary, swap the smallest son with c e . 
Processing a single node this way costs t comparisons and 3 element moves. After that, the 
heap property is re-established for the son just swapped in the same way. This may activate 
a further walk, up to some leaf. 

Taking into account that there are h^ nodes with paths of lengths 1, 2, 3, or 4 (starting 
from the given node and ending in a leaf), h^ nodes with paths of lengths 2, 3, or 4, 



t = [(logm) 4 / 5 ] 
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nodes with paths of lengths 3 or 4, and nodes with paths of length 4, we get that 
building the heap costs t ■ J2i=i < comparisons and 3-£ 4 =1 /iW < 6/i fc /(logm) 4 / 5 
moves. 

After building the heap, the routine transports, times, the smallest element from 
the heap to the output block A. Here the moves are organized as follows. First, save the 
leftmost buffer element, not moved yet from A, in the current location of the hole. Then 
find the smallest element, placed in one of the t roots, and move this element to A. After 
that, find the smallest element among the t sons of this root, and move this element to the 
node corresponding to its father. Iterating this process at most five times, we end up with 
a hole in some leaf. Now, we are done. The hole in the leaf will be filled up by a buffer 
element in the future, as a side effect. (Usually, in the next iteration, extracting the next 
smallest element from the heap). 

Thus, unlike in the standard version of heapsort, the size of the heap does not shrink 
but, rather, some new buffer elements are inserted into the heap structure, filling up the 
leaf holes. These buffer elements are then handled by the extracting routine in the standard 
way, as ordinary active elements. Since these elements may travel down, from the leaf level 
closer to the root level, a node containing a buffer element may have a son containing a 
smaller buffer element. This will do no harm, however, since each buffer element is strictly 
greater than any active element, because of the buffer separator b~. Thus, no buffer element 
can be extracted from the heap as the smallest element in the first iterations, when the 
routine terminates. 

Deriving the costs of the above routine is straightforward. The routine repeats hk iter- 
ations, performing each time at most 5(t— 1) < 5(logm) 4 / 5 comparisons and 6 moves, since 
the heap has at most five levels. This gives hk -5(log m) 4//5 comparisons and /ifc-6 moves. 

Now we can sum the costs of sorting the segment o~k- Determining the value of hk 
costs O(loglogm) comparisons. Building the heap costs at most 2hk comparisons and 
6/i fc /(logm) 4 / 5 moves. Extracting active elements in sorted order costs hk ■ 5(log m) 4y/5 
comparisons and hk ■ 6 moves. Summing up, we get hk ■ 0((log m) 4 / 5 ) comparisons and 
/i fc -(6/(logm) 4 / 5 +6) moves. 

To obtain the total cost of sorting all segments 0o, 01, 02, . . . , <T/ m , we use ^ ne ^ ac * * na ^ 
2~^{=o < m ) since the number of active elements stored in the segments is bounded by 
the total number of active elements. Therefore, the sum over all segments results in the 
following upper bounds: 

Lemma 3. Sorting all segments does not require more than 0(m- (log m) 4 / 5 ) comparisons 
or 6m + 0(m/ (log m) 4//5 ) moves. 

Alternatively, we could use the heap structure with parameter t = [log m] . This results 
in a heap with four levels, instead of five (since [x] 4 > \x ], for each real x > 0). This 
reduces the leading factor for the number of moves from 6m to 5m. The price we pay is 
increasing the number of comparisons, from o(m dog m) to 4m dog m + 0{m). The detailed 
argument is very similar to the proof for t = [(log m) 4 / 5 ] . 

2.9. Rebalancing at the segment level. This procedure is activated by the routine of 
Sect. 12. bl inserting a new active element in the structure, when, for some k, the segment 
has become full, having absorbed s active elements. 
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At the moment of activation, some global index variable is pointing to the starting 
position of cr^. The procedure also remembers £, the position of the associated active 
element a\ t = xi in the frame memory, as well as j, the position of the frame block containing 
the element a/~. We shall call this block the current frame block. (If o\~ = ctq, i.e., k = 0, 
there is no associated element in the frame. Then ^ = 0, but we still have the current frame 
block, namely, j = l). The above indices were computed when the latest active element was 
inserted in the structure. 

First, by the use of a binary search with the buffer separator 6~ over the r locations in 
the current frame block, find £', the position of the leftmost buffer element in this block. 
We shall denote this element by b . Recall that we maintain the invariant that each active 
frame block has a room for one more active element, and therefore it does contain at least 
one buffer element. 

Second, find a median in the segment a^, i.e., an element a of rank |_s/2j + l. Without 
loss of efficiency, the selection procedure will position a at the end of a^. 

Third, the median a° is inserted in the current frame block, one position to the right of a^. 
The active elements lying in between at and b°, that is, occupying locations xg + i . . . x&-i 
in the frame memory, are shifted one position to the right. At the same time, b is saved 
from X(i to the location released by a at the end of the segment at- (As a special case, if 
afc is the rightmost active element in the current frame block, only b and a are swapped. 
The same holds when ctq is rebalanced for the first time, with £ = and £' = 1). Since a° has 
been picked from cr/%, it satisfies < a < a^+i) an d hence the sequence of active elements 
stored in the frame memory remains sorted. 

Fourth, after shifting the active elements in the locations . . . a^'-i one position to 
the right, we have to shift the corresponding pointers ne+i ■ ■ -ire'-i as well, so the active 
elements remain connected with their segments. To move an integer pointer value from 7r e 
to 7r e +i, we only have to read the value encoded in n e and, at the same time, clear 7r e , and 
then to encode this value in 7r e +i- Such transport of a pointer costs 0{p) comparisons and 
moves. 

Fifth, we need to connect a new active element in the frame with a new segment. This 
concerns the element a , now placed in xg + \. Thus, we allocate a new segment a and encode 
its starting position in the pointer ni+i. 

Sixth, the full segment is halved, that is, we place some |_s / 2j active elements greater 
than or equal to a° into the left part of a and collect the remaining [_s/2\ active elements, 
smaller than or equal to a , in the left part of the original segment . Since many elements 
may be equal to a , we distribute such elements both to cr/% and a , so that their active 
parts are of equal lengths. This also requires to save \_s/2\ buffer elements, placed originally 
in a , to the locations released in o^. (We shall give more details below, in Sect. I2.l0|) , The 
outcome of halving is that the active elements in at are split into two segments and a , 
satisfying ak<a<a° and a°<a<<Zfc+i, respectively. 

Seventh, if there is still a room for storing one more active element in the current frame 
block, the structure has been rebalanced. We are done, ready to take the next element 
from A. However, if this block has become full, because of a , the program control jumps 
to a routine rebalancing the frame level, described later, in Sect. I2~TT1 

Let us now derive the computational costs. The binary search, determining the position 
of the leftmost buffer element in the current frame block, inspects a range of r elements, 
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performing l+|_logrJ < O(loglogm) comparisons, by Ijl]). Finding a median, in a segment 
of length s, requires only O(s) < 0((logm) 4 ) comparisons and e-s < e • (2 + (log m) 4 ) 
element moves, where e > is an arbitrarily small, but fixed, real constant, by [Hj and ([T|). 
Rearranging the elements a°, 6°, and x^+i . . . in their locations can be done with at most 
r+2 < O(logm) moves, by @. Shifting the pointers 7r^ +1 . . . 7T£/_i one position to the right 
costs 0(r-p) < 0((logm) 2 ) comparisons, by (|1J) and (J7J), together with the same number 
of moves. Encoding the starting position of a new segment in the pointer 7T£+i requires 
0(p) < 0(log m) element moves, by ((7J). Halving the active elements in into two segments 
crfc and a° requires only 0(s) < 0((logm) ) comparisons and 3/2 -s < 3/2 -(2 + (log m) ) 
moves, using Lem. [SJ displayed in Sect. 12.101 below, and JQ). 

By summing the bounds above, we get that a single activation of the procedure rebal- 
ancing a segment performs 0((logm) 4 ) comparisons and (3/2 + e) • (log m) 4 moves. Taking 
into account that each activation increases the number of active segments, that we start 
with one segment, namely, <7o, and that we end up with / m + l segments, we see that the 
number of activations is bounded by / m . This value is bounded by f m < % < 2ra/(log to) 4 , 
using (jHJ) and ((21). This gives: 

Lemma 4. The total cost of keeping the segment level balanced is 0{m) comparisons and 
(3+e)-m moves, where e>0 is an arbitrarily small, but fixed, real constant. 

2.10. Halving a segment. Here we describe a simple procedure for halving, needed in 
Sect. 12.91 above. We are given a segment of size s, and a median a°, that is, an element 
of rank [s/2j+l, put aside. We want to place some [s/2\ active elements greater than or 
equal to a° into the left part of another given segment a°, of size s again, and collect the 
remaining \_s/2\ elements smaller than or equal to a° in the left part of o~k- The first \_s/2\ 
buffer elements of <t° must be saved. 

In the first phase, with s — 1 comparisons and no moves, we count c', the number of 
elements strictly smaller than a°, in o^. This gives us c = [s/2\—c', the number of elements 
equal to a° that should remain in a^. This number will be required in the second phase, 
when each element a of ak is compared with a° twice, using "a < a° " and "a > a° ". The 
elements strictly smaller than a° and the first c elements detected to be equal to a° will be 
considered "small," while the remaining equal elements and those strictly greater than a 
will be "large." Each time an element a = a° will be detected, the counter c will be decreased 
by one, until it gets to zero. From then on, any "new" element a will be considered "small" 
if and only if a < a°, and "large" otherwise. 

In the second phase, the configurations of the segments are ak = AiUBib® and cr° = 
A2B2, where A\ and A2 denote, respectively, the active elements of o~k found to be "small" 
or "large," collected so far, B\ the buffer elements moved from a to cr^, B2 the elements 
of cr not moved yet, U the elements of ak not examined yet, and 6° a single buffer element, 
filling up the room. A2 and B\ are of equal length, not exceeding [_s/2\. Initially, ak = Ub <> , 
a° = B2, with Ai,A2, and B\ empty. The procedure also remembers the current position of 
the hole. (After the first iteration, the hole is always in the leftmost location of B\). 

The second phase proceeds in a loop, as follows. Using at most two comparisons, the 
rightmost element a of U is determined to be "small" or "large." If a is large, we save the 
leftmost element from B2 in the current location of the hole and fill up the new hole in B2 
by a. Thus, A2 and B\ have been extended, while U and B2 have been reduced. If a is small, 
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we scan U from left to right until we find the first element a' that is large. All elements on 
the left of a' become a part of A%, without being moved. Since a is a small element placed 
on the right of the position [s/2\, a' must be found before we reach the position |_s/2j + l, 
or else we would have more than [s/2\ small elements, which is a contradiction. Now we 
save the leftmost element from E>2 to the hole, fill up the hole in B2 by a\ and move a to 
the place released by a'. Then all necessary boundaries are updated. 

This is repeated until we have transported exactly \ L s/2\ active large elements from Ufc 
to <t°. As a consequence, the remaining \_s/2\ active elements of cr^, placed on the left of Bi, 
must be all small, since the rank of a° is [s/2\ +1 and s is odd, by 

Clearly, we have used at most 3s comparisons in total, and at most three moves per each 
large element moved from to o~°. This gives: 

Lemma 5. Given a median a°, a segment of size s can be halved with at most O(s) com- 
parisons and 3/2 -s moves. 

2.11. Rebalancing at the frame level. This routine is activated by the procedure of 
Sect. 12.91 rebalancing a segment, when it finds out that, for some j, the jth frame block has 
become full, having absorbed r active elements. As a side effect, the routine may increase 
the number of active blocks in the frame. The routine is based on a new variant of the 
well-known data structure (see ) > used to maintain a set of elements in sorted order 
in a contiguous zone of memory. 

For the purpose of keeping the frame memory balanced, the frame consisting of r # frame 
blocks is viewed, implicitly, as a complete binary tree with r # = 2 r_1 leaves, and hence of 
(edge) height r— 1. We introduce the following numbering of levels: i = for the leaves, 1 for 
their fathers, and so on, ending by i = r — l for the root. Each node of the tree is associated 
with a contiguous subarray of the frame blocks, and with a path leading to this node from 
the root, as follows. 

The jth leaf, for any j ranging between 1 and 2 r_1 , is associated with the jth. frame block, 
i.e., with a subarray consisting of 1 = 2° frame blocks, starting from the block position j. 
The corresponding path from the root to this leaf is represented by the number j= j— 1. It 
is easy to see that by reading the binary representation of j from left to right (with leading 
zeros so that its length is r— 1) we get the branching sequence along this path; is interpreted 
as branching to the left, while 1 as branching to the right. 

Given a node v at a level i, associated with a path number j and with a subarray of 
length 2 % blocks, starting from a block position j, the father v' of this node is associated 
with the path number j' = [j/2\, and with the subarray of length 2 i+1 , starting from the 
block position j'=j, if J* is even (v is a left son of v'), but from j'=j—2\ if J* is odd (right 
son). Thus, the subarray for the father is obtained by concatenation of the two subarrays 
for its sons, while its path number by cutting off the last bit in the path number for any of 
its sons. 

During the computation, the number of active elements in some local area of the frame 
may become too large. The purpose of rebalancing a subarray, associated with a node v 
at a level i, for i > 0, is to eliminate such local densities and redistribute active elements 
more evenly. More precisely, after rebalancing the subarray, the following two conditions 
will hold: 
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(i) The number of active elements, in any frame block belonging to the subarray asso- 
ciated with the given node v at the level i, will not exceed the threshold Tj = r — i. 

(ii) The frame memory will not contain any free blocks (without active elements) in 
between some active blocks. 

Note that, if a node v at a level % > is an ancestor of the jth leaf, the condition JlJ) 
ensures that the jth frame block is not full any longer. Neither is any other block within the 
subarray. Such redistribution of active elements is possible only if a(v), the total number of 
active elements in the subarray associated with v, is bounded by a(v) < Tj-2 l . We say that 
the node v overflows, if a(v) > Tf2\ 

The condition (JuJ) is required only because of the procedure presented in Sect. 12.61 
transporting active elements from the block A to the structure. Recall that this procedure 
uses a binary search over the leftmost locations in the active frame blocks, and hence these 
blocks must form a contiguous zone. 

Now we can describe the routine rebalancing the frame. 

First, starting from the father of the frame block that is full, climb up and find the lowest 
ancestor v that does not overflow, with a(v) < Tj-2*. The formulas for j and j, presented 
above, give us a simple tool for computing the boundaries of the associated subarrays, along 
the path climbing towards the root. To compute the value of ot{v), for the given ancestor v 
at the given level i, simply scan all 2 % frame blocks forming the associated subarray and sum 
up the numbers of active elements in these blocks, using a binary search with the buffer 
separator b~ over the r locations in each block. 

Second, move the a(v) active elements in the associated subarray of v to the last a{v) 
locations. That is, processing all 2 l -r locations in the subarray from the right to left, collect 
all elements smaller than b~ to the right end. Before moving an active element from x e 
to x e i, for some e < e', the buffer element in the target position x e i is saved to the current 
location of the hole. Then move the associated pointer in the corresponding positions of the 
pointer memory, from n e to ir e ', by reading and clearing the bit value encoded in 7r e and 
encoding this value in ir e '. 

Third, redistribute the a(v) active elements back, this time more evenly in the 2* frame 
blocks of the subarray, moving also the pointers in the corresponding positions, as follows. 
Let a D = [a(v)/2 l \ and a M = a(v) mod 2*. Then put a D + l active elements in each of the 
first a M blocks, and a D active elements in each of the remaining 2 l — a M blocks. In each 
block, the active elements are concentrated in its left part. 

Fourth, as a side effect of redistribution, the size of the active part in the frame memory 
may have been increased. This requires to update the value of g, the number of active frame 
blocks, kept in a separate global index variable. Let g' be the block position of the rightmost 
frame block in the subarray of v. Then let g := ma.x{g, g'}. 

It should be pointed out that, for each leaf, the desired ancestor v without overflow does 
exist. Using ©, @, and r, = r — i, for the level i = r—1, that is, for v being the root 
node, we get a(v) = /<?# = l-2 r_1 = r r _i-2' r ~ 1 , and hence at least the root node does 
not overflow. Therefore, in the first step, the loop climbing up towards the root must halt 
correctly. 

Further, the redistribution of active elements, presented in the third step, is correct, 
since (a D +l)-a M + ct D -(2 l — a M ) = a{y). It is easy to see that the redistribution satisfies the 
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condition (jy) above, using the fact that the node v does not overflow, and hence a(v) < Tj-2*. 
There are two cases to consider: For a{v) < Tj-2* — 1, we have a D + l < — 1)/2*J+1 < 

(rj— 1)+1 = Tj, since Tj is an integer. If a(v) = Tj-2*, we get a M = 0, and hence all 2* blocks are 
"remaining," with only a D active elements in each. But here a D = |_(Tj-2 J )/2*J = [t{\ = T{. 

It is also easy to see that the redistribution satisfies the condition (|nj). Since v is the 
lowest ancestor that does not overflow, along the path from the full frame block towards the 
root, it must have a son that does overflow, with at least Tj_i-2* -1 active elements in its 
subarray. (As a special case, for i = l, we get tq-2° = (r — 0)-l = r active elements in the 
jth frame block that is full). The subarray of the son is a part of the subarray associated 
with v, and hence a(v) > Tj_i-2* _1 > 2-2 i ~ 1 , using the fact that i — 1 <r — 2. But then 
a D = [a(v)/2 l \ > 1. This implies that each frame block in the subarray associated with v 
contains at least one active element after redistribution, and hence the zone of active frame 
blocks will remain contiguous. 

Consider now the cost of a single activation of the above routine, rebalancing a subarray 
for a node v at a level i. Looking for the lowest ancestor without overflow requires to count 
the numbers of active elements in the associated subarrays along a path climbing up from 
a father of a leaf, for levels e = 1, . . . ,i. In the eth level, 2 e blocks are examined, by a 
binary search over the r locations of the block. By Q), this gives J2e=i 2 e -(l + [logrj) < 
2 l -0(loglogm) comparisons. The cost of the second step, collecting a(v) active elements 
to the right end, is 2 % • r comparisons (one comparison with b~ for each location in the 
subarray), plus a(t>)-2 + l moves (two moves per each collected element). However, with 
each collected element, the corresponding pointer must also be transported, which gives 
additional a(v)-0(p) comparisons and moves. Using a(v) < Tj-2* < r-2 l , together with 
® and (J7J), the cost of the second step can be bounded by 2 l -0{r-p) < 2 l ■ 0((log m) 2 ) 
comparisons and moves. The same computational resources are sufficient in the third step, 
redistributing the same number of active elements back, but more evenly, together with 
their pointers. Again, this gives a{v)-0{p) comparisons and moves, which can be bounded 
by 2 l ■ 0((log m) 2 ). Finally, the fourth step does not require any element comparisons or 
moves, it just updates one index variable, in 0(1) time. 

Summing up, the cost of a single activation is 2 l -0((logm) 2 ) comparisons and moves, 
for each node v at the fixed level i>0. To get the total cost, we must take into account how 
frequently such rebalancing is activated. 

When a rebalancing is activated, v must have a son with at least Tj_i-2 l_1 active elements, 
since v is the lowest ancestor that does not overflow, along some path climbing up. Now, 
trace back the history of computation, to the moment when the entire subarray associated 
with v was a subject of redistribution for the last time. This way we get a node v', either 
an ancestor of v or v itself, at a level i' > i, with the associated subarray containing the 
entire subarray for v. After the redistribution for v', both sons of v contained at most 
Tj'-2 l_1 < Tj-2 1-1 active elements. Thus, in the meantime, the number of active elements 
in one of the sons of v has been increased by at least Tj_i • 2 l ~ l — Tj • 2* _1 = 2*~ 1 . Since 
other redistributions, taking place between the moments of rebalancing v' and v , could not 
"import" any active elements to the subarray of v from any other parts of the frame, the 2* _1 
additional active elements must have been inserted here. (See the procedure of Sect. 12.91 
third step). Thus, there have to be at least 2 J_1 insertions in the associated subarray 
between any two redistributions for v. Note that, for the fixed level i, subarrays associated 



16 



with different nodes v do not overlap. Thus, we can charge the cost of each activation, for the 
given node v, to the 2* _1 insertions preceding this activation in the given subarray, without 
charging the same insertion more than once. This gives 2*-0((logm) 2 )/2 l_1 < 0((logm) 2 ) 
comparisons and moves, per a single insertion of an active element in the frame memory. 
Since, in the whole computation, there were only / m < r # < 0(m/(logm) 4 ) insertions, by 
© and (|1J), we get the cost 0(m/(log m) 2 ) comparisons and moves, for rebalancing of all 
nodes at the fixed level i. By summing over all levels, using i < r — 1 < logm, by (@J, we 
get the total cost: 

Lemma 6. The total cost of keeping the frame memory balanced is 0(m/ log m) compar- 
isons, together with the same number of moves. 

2.12. Summary. By summing the bounds presented in Lems. HHH an d El above, we get: 

Theorem 7. The cost of sorting the given block A of size m is 2m-log m + 0(m-(log m) 4//5 ) 
comparisons and (ll+e)-m moves, where e>0 is an arbitrarily small, but fixed, real constant, 
provided we can use additional buffer and pointer memories, of respective sizes 3m — 1 and 
[4m/ (log m) 2 \ . 

The algorithm presented above assumes that m is "sufficiently large," so that s, defined 
by ©, satisfies s<m. This presupposition holds for each m > 2 16 = 65536. Shorter blocks 
are handled in a different way, by the procedure described later, in Sect. \'A.'A\ The bounds 
presented by Thm. [7|for the number of comparisons and moves will remain valid. 

3. In-Place Sorting 

Now we can present an in-place algorithm sorting the given array A consisting of n elements. 
If n<2 16 , the array is sorted directly, by the procedure of Sect. I3.3| described later. In the 
general case, for n > 2 16 , the task of the main program is to provide sufficiently large pointer 
and buffer memories for the procedure presented in Sect. |^1 

3.1. Building a pointer memory. The size of the largest block ever sorted by the proce- 
dure of Sect.EJwill not exceed m = n/4. Using © and the fact that the function Ax/(logx) 2 
is monotone increasing for x>8, we see that the size of the pointer memory can be bounded 
by P = [4(n/4)/(log(re/4)) 2 J = [n/(log(n/4)) 2 J . This will suffice for all sorted blocks. 

The pointer memory is built by collecting two contiguous blocks Hl and I1r. The block 
IIl, placed at the left end of A, will contain the smallest P elements of the array ^4, while 
Br, placed at the right end, the largest P elements. 

The block IIr is created first, by the use of the heapsort with t root nodes and internal 
nodes having t sons. The detailed topology of edges connecting nodes in this kind of heap 
has been presented in Sect. 12.81 devoted to extracting sorted elements at the segment level. 

However, there are some substantial differences from the generalized heapsort of Sect. 12.81 
This time the branching degree is t = [log n\ . Therefore, the heap has q < 1 + [k>g t n\ < 
0(log nj log log n) levels. Here we keep large elements at the root level, instead of small 
elements. That is, no node contains an element smaller than any of its sons. Unlike in 
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Sect. 12.81 no buffer elements are used here to fill up the holes, the heap structure shrinks in 
the standard way, when the largest element is extracted. 

The initial building of the heap structure is standard, and agrees with the heap building 
in Sect. 12.81 It is easy to see that, for a heap with n elements, branching degree equal to t, and 
q levels, the cost of the heap initialization can be bounded by t-J2i=i n A* < rvt/(t-l) < 0(n) 
comparisons and 3-J^f=i < n-3/(i — 1) < 0(n/ logn) moves, using t > logn. 

After building the heap, the routine extracts, P times, the largest element from the 
heap in the standard way. That is, when the largest element is extracted, it replaces the 
element in the rightmost leaf, which in turn is inserted into the "proper" position along the 
so-called special path, starting from the position of the largest root (just being extracted) 
and branching always to the largest son. 

The costs of the above routine are straightforward. The trajectory of the special path can 
be localized with q-(t— 1) comparisons, and the new position for the element in the rightmost 
leaf can be found by a binary search along this trajectory with 1+ [log comparisons. 
Summing up, an extraction of the largest element can be done with q-(t — 1) + (1+ [log gj ) 
comparisons, together with q + 2 moves. Using t < O(logn) and q < O (logn/ log logn), 
we get, per a single extraction, at most 0((logn) 2 / log logn) comparisons, together with 
O ( log n/ log log n ) moves . 

If we let the above procedure run till the end, it would sort the entire array A in time 
0(n-(logn) 2 / log logn). However, the execution is aborted as soon as the largest P elements 
are collected. Since P < 0(n/(log n) 2 ), the cost of building the heap becomes dominant, 
and hence the block Hr is created with 0(n) comparisons and 0(n/logn) moves. 

After Hr, the block Hl is created in the same way, with the same computational needs of 
comparisons and moves. Instead of large elements, here we collect the smallest P elements. 
In addition, since Hl should be created at the left end of A, all indices are manipulated in 
a mirrorlike way, seeing the first position to the left of Hr as the beginning of the array. 

Lemma 8. Building the pointer memory requires 0(n) comparisons and 0{nj log n) moves. 

Now the configuration of the array A has changed to Hl-4'IIr, where A' denotes the 
remaining elements, to be sorted. Before proceeding further, the algorithm verifies, with a 
single comparison, whether the largest (rightmost) element in IIl is strictly smaller than 
the smallest (leftmost) element in Hr. 

If this is not the case, all elements in A' must be equal to these two elements. Therefore, 
the algorithm terminates, the entire array A has already been sorted. 

Conversely, if IIl and Hr pass the test above, they can be used to imitate a pointer 
memory consisting of P bits. 

3.2. Partition-based sorting. When the blocks IIl and Hr have been created, the zone 
A 1 is kept in the form A S A V , where A s and A v represent the sorted and unsorted parts of A', 
respectively. Each element in A s is strictly smaller than the smallest element of Av The 
routine described here is a partition-based loop. In the course of the ith iteration, the length 
of A v is ni, with n,<n,_i. Initially, for i = 0, A s is empty, A V = A', and no = n — 2P < n. 
The loop proceeds as follows. 
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First, find b , an element of rank [n.j/4] in A v . The selection procedure places this 
element at the right end of A v , so the configuration of A' changes to A s A' v b~. Here A' v 
denotes a mix of elements in A v , of length n$— 1. 

Second, A' v is partitioned into two blocks A K and B> consisting, respectively, of elements 
strictly smaller than b~ and of those greater than or equal to b~. The configuration of the 
array thus changes to A s A < B > b~. The respective lengths of A K and B > will be denoted 
here by and nj >. Note that, even for a large block A v , we may obtain a very short 
block A Kl since many elements may be equal to b~. In fact, the block A K may even be 
empty, of length nj j<: =0. 

Third, sort the block A < by the procedure described in Sect. El using some initial seg- 
ments of IIl and Hr as a pointer memory and of B > as a buffer memory, with b~ as a 
buffer separator. This is possible, since b~ has been selected as an element of rank [n*/4~|, 
and hence n» < < [nj/4] — 1 < rij/4, with rij < + r&j i > + 1 = n^. But the required size of 
buffer is only 3rii t< — 1 < 3/4-nj — 1 = rij — 1 — rij/4 < n^ — 1 — nj< = nj>. Therefore, the 
block £>> of length nj > is sufficiently long. Similarly, the required number of bits for point- 
ers is L4nj )< /(log ni :< ) 2 \ < |_4(n/4)/(log(n/4)) 2 j = P, and hence the pointer memory is also 
sufficiently large. (If rii t< <2 16 , A K is sorted as a short block). 

Fourth, restore the sorted order in IIl and Hr, by clearing all bits of the pointer memory 
to zero. Among others, this is required because the procedure of Sect. El will also be used 
in subsequent iterations, when it assumes that all bits are initially cleared. 

Fifth, after sorting A K , the configuration of A' is A s A < s B' > b~, where A <iS denotes the 
sorted version of the block A< and B> a mixed up version of B> . Now put the first element 
in B^ aside and move 6~ to the first position after A < s . After that, collect all elements 
smaller than or equal to b~ to the left part of BL., processing also the element put aside. 
Since B > did not contain elements strictly smaller than b~, this actually partitions B^ 
into two blocks A = and B > consisting, respectively, of elements equal to b~ and of those 
strictly greater than of respective lengths rii- and rii t> . Clearly, rii t= +ni t> = ni t> . The 
configuration has changed to A s A < s b~A = B > . 

Sixth, observe that A s A < s b~A= and B> can be viewed as "new" variants of blocks 
.A s and A v . Thus, we can start a new iteration, with B> as a new block A v , of length 
n i+i = n i,>- The above process is iterated until the length of unsorted part drops to 2 16 , 
or below. This residue is then sorted as a short block, without using a buffer or pointers, 
which will be described later, in Sect. 11131 

Now we can derive computational costs. First, recall that b~ has been selected as an 
element of rank [nj/4], and hence = > < n^ — |Vti/4] < 3/4-nj. Taking into account 
that riQ<n, we get ni < (3/4)* -n, for each i>0. This gives that 

Ef^o 1 n i < 4 ^ i (a) 
1 < O(logn), V ; 

where X denotes the number of iterations. Second, it is easy to see that 

Ef=o 1 K< + l+"i,=) + n i < n > ( 10 ) 

since, in different iterations, the final locations occupied by A <tS , b~, and A = , do not overlap. 
Here nj denotes the length of the residual short block. 
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Let us now present the costs for the ith iteration. Selection of b~, an element of the given 
rank in a block of length m, costs 0(rii) comparisons and e-rij moves, by [3]. Partitioning 
of A' v into blocks A < and B > can be done with comparisons and 2rij j< + l moves, since 
the length of A f v is rii — 1, and the number of collected elements, strictly smaller than b~, 
is n% t< . The cost of sorting the block A K is bounded by 2nj j< -lognj i< + 0(ni i< -(lognj )< ) 4 / 5 ) < 
2rii (< dog n + 0(rii t< ■ (log n) 4//5 ) comparisons and (ll+e)-^^ moves, by Thm.[7| Sorting of 
the block A K is followed by restoring the sorted order in I1l and Hr, by clearing all bits, 
which costs O(P) < 0(n/(logn) 2 ) comparisons, together with the same number of moves. 
Positioning b~ to the right of vl<,s requires only 2 element moves. Finally, the ith iteration 
is concluded by partitioning into blocks A = and B> , with at most m > < rii comparisons 
and 2rii- + \ moves, since the length of i?> is rii } >, and the number of collected elements, 
equal to b~, is n^ = . The cost of sorting the residual short block does not exceed the bounds 
for the standard case; 2njdog nj + 6.25nj < 2njdog n + 0(nj- (log n) 4 / 5 ) comparisons and 
9.75nj < {ll+e)-nx moves. (See Sect. 13.31 below ). 

Now we can sum the above costs over all iterations, using (JUJ) and (|l(Jj) . For the number 
of comparisons, this gives 

C(n) < E£o ni-O(l) +E? =0 1 ^<-(21ogn+O((logn) 4 / 5 )) + Y,J=o 0(n/ (log n) 2 ) 
+ n x -(21ogn+0((logn) 4 / 5 )) 

< 0(n) + (E?=o 1 K< + l+^,= ) ■ (21ogn + 0((logn) 4 / 5 )) + 0(n/logn) 

< 0(n) +n- (2 log n+0((log n) 4/5 )) + 0(n/ log n) 

< 2ndog n + 0(n- (log n) 4/5 ) . 

For the number of moves, we get 

M(n) < E?=o^^ + E?=o^l3+e)-n,, < +E?=o 1 2n^ + Eto 1 OW( 1 og^ 
+ (ll+e)-nx 

< £-n + (E?=o 1 K< + 1 +"i,=) + n ^) • (13+e) +0(n/ log n) 

< e-n + n-(13+e) + 0(n/ log n) 

< (13+e)-n, 

where £ > is an arbitrarily small, but fixed, real constant. The above analysis did not 
include the costs of the initial building of pointer memory. However, by Lem. |S1 this can be 
done with only 0(n) comparisons and 0{nj log n) moves, and hence the bounds displayed 
above represent the total computational costs of the algorithm. 

Theorem 9. The given array, consisting ofn elements, can be sorted in-place by performing 
at most 2n-logn + o(n-logra) comparisons and (13+e)-n element moves, where e>0 denotes 
an arbitrarily small, but fixed, real constant. The number of auxiliary arithmetic operations 
with indices is bounded by O(ndogn). 

3.3. Handling short blocks. The algorithm presented above needs a procedure capable 
of sorting blocks of small lengths, namely, with m < 2 16 = 65536. This is required, among 
others, to sort blocks A K that are short. We could sweep the problem under the rug by 
saying that "short" blocks can, "somehow," be sorted with 0(1) comparisons and moves, 
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since they are of constant lengths. However, the upper bounds presented by Thm. [7| in 
Sect. |2~T2"1 require some more details, especially for (ll+e)-m, the number of moves. Last 
but not least, these lengths are important in practice. 

One of the possible simple solutions is to use our version of heapsort, with 5 roots and 
internal nodes having 5 sons. Using the analysis presented in Sect. 13 .ll devoted to building 
a pointer memory, for i = 5, m<2 16 , and hence for at most q < 1+LlogjmJ < 7 levels, one 
can easily verify that we shall never use more than 2m-logm + 6.25m. comparisons or 9.75m 
moves. (These bounds are not tight, we leave further improvement to the reader). 

3.4. An alternative solution. As pointed out at the end of Sect. 12.81 devoted to extract- 
ing sorted elements from segments, we could use a heap structure with four levels, instead 
of five, in a segment. This slightly reduces the number of moves, but increases the number 
of comparisons. The detailed argument parallels the proof of Thm.|Hl and hence it is left to 
the reader. 

Corollary 10. The given array, consisting of n elements, can be sorted in-place by per- 
forming at most 6n-logn + o(n-logn) comparisons and (12+e)-n element moves, where e>0 
denotes an arbitrarily small, but fixed, real constant. 

4. Concluding Remarks 

We have described the first in-place sorting algorithm performing 0(n- log n) comparisons 
and 0(n) element moves in the worst case, which closes a long-standing open problem. 

However, the algorithms presented in Thm. Inland Cor. ^1 do not sort stably, since the 
order of buffer elements may change. If some elements used in buffers are equal, their 
original order cannot be recovered. This leaves us with a fascinating question: 

Does there exist an algorithm operating in-place and performing, in the worst 
case, at most 0(n -log n) comparisons, 0{n) moves, O(ndogn) arithmetic oper- 
ations, and, at the same time, sorting elements stably, so that the relative order 
of equal elements is preserved? 

At the present time, we dare not formulate any conjectures about this problem. The 
best known algorithm for stable in-place sorting with 0(n) moves is still the one presented 
in ^2Jj performing 0(n 1+£ ) comparisons in the worst case. 

We are also firmly convinced that the upper bounds of Thm. and Cor. ^] are not 
optimal and can be improved, which is left as another open problem. 
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